In this project, the open-source R programming language is used to model the progression in the COVID-19 pandemic in different U.S. counties. R is maintained by an international team of developers who make the language available at The Comprehensive R Archive Network. Readers interested in reusing our code and reproducing our results should have R installed locally on their machines. R can be installed on a number of different operating systems (see Windows, Mac, and Linux for the installation instructions for these systems). We also recommend using the RStudio interface for R. The reader can download RStudio for free by following the instructions at the link. For non-R users, we recommend the Hands-on Programming with R for a brief overview of the software’s functionality. Hereafter, we assume that the reader has an introductory understanding of the R programming language.
In the code chunk below, we load the packages used to support our analysis. Note that the code of this and any of the code chunks can be hidden by clicking on the ‘Hide’ button to facilitate the navigation. The reader can hide all code and/or download the RMD file associated with this document by clicking on the Code button on the top right corner of this document.
if(require(pacman)==FALSE) install.packages("pacman") # check to see if the pacman package is installed; if not install it
if(require(devtools)==FALSE) install.packages("devtools") # check to see if the devtools package is installed; if not install it
# to check and install if these packages are not found locally on machine
if(require(albersusa)==FALSE) devtools::install_github('hrbrmstr/albersusa') #install package if needed
if(require(albersusa)==FALSE) devtools::install_github('dreamRs/r2d3maps') #install package if needed
# check if packages are not installed; if yes, install missing packages
pacman::p_load(tidyverse, magrittr, janitor, dataPreparation, lubridate, skimr, # for data analysis
COVID19, rvest, # for extracting relevant data
DT, pander, stargazer, knitr, # for formatting and nicely printed outputs
scales, RColorBrewer, DataExplorer, tiff, grid,# for plots
plotly, albersusa, tigris, leaflet, tmap, # for maps
zoo, fpp2, NbClust, # for TS analysis and clustering
VIM, nnet, caret, # explanatory modeling
conflicted) # for managing conflicts in functions with same names
# Handling conflicting function names from packages
conflict_prefer('combine', 'dplyr') # Preferring dplyr::combine over any other package
conflict_prefer('select', "dplyr") # Preferring dplyr::select over any other package
conflict_prefer("summarize", "dplyr") # Preferring dplyr::summarize over any other package
conflict_prefer("filter", "dplyr") # Preferring filter from dplyr
conflict_prefer("dist", "stats") # Preferring dist from stats
conflict_prefer("as.dist", "stats") # Preferring as.dist from stats
# Custom Functions
source('custom_functions.R')
# Setting the seed
set.seed(2020) # to assist with reproducibility
sInfo = sessionInfo() # saving all the packages/functions and session info
For our analysis, we fuse data from multiple sources. We describe the process of obtaining and merging each of these sources in the subsections below.
In this section, we utilize the COVID19 package to obtain the following information: (Guidotti & Ardia, 2020)
- Confirmed cases, recoveries and deaths;
- Policy information (e.g., transport closing, school closing, closing event, movement restrictions, testing policies, and contact tracing); and
- Population and standard geographic information for each county.
From this information, we have also computed the new daily and weekly confirmed cases/deaths per county. The data is stored in a tidy format, but can be expanded to a wide format using pivot_wider() from the tidyverse package.
endDate = '2020-11-07'
endDatePrintV = format(ymd(endDate), format = "%b %d, %Y")
counties = covid19(country = "US",
level = 3, # for county
start = "2020-03-01", # First Sunday in March
end = endDate, # end Date
raw = FALSE, # to ensure that all counties have the same grid of dates
amr = NULL, # we are not using the apple mobility data for our analysis
gmr = NULL, # we are not using the Google mobility data for our analysis
wb = NULL, # world bank data not helpful for county level analysis
verbose = FALSE)
counties %<>% # next line removes non-contiguous US states/territories
filter(!administrative_area_level_2 %in% c('Alaska', 'Hawaii', 'Puerto Rico', 'Northern Mariana Islands', 'Virgin Islands')) %>%
fastFilterVariables(verbose = FALSE) %>% #dropping invariant columns or bijections
filter(!is.na(key_numeric)) %>% # these are not counties
group_by(id) %>% # grouping the data by the id column to make computations correct
arrange(id, date) %>% # to ensure correct calculations
mutate(day = wday(date, label = TRUE) %>% factor(ordered = F), # day of week
newCases = c(NA, diff(confirmed)), # computing new daily cases per county
newDeaths = c(NA, diff(deaths)) ) # computing new daily deaths per county
# manually identifying factor variables
factorVars = c("school_closing", "workplace_closing", "cancel_events",
"gatherings_restrictions", "transport_closing", "stay_home_restrictions",
"internal_movement_restrictions", "international_movement_restrictions",
"information_campaigns", "testing_policy", "contact_tracing")
counties %<>% # converting those variables into character and then factor
mutate_at(.vars = vars(any_of(factorVars)), .funs = as.character) %>%
mutate_at(.vars = vars(any_of(factorVars)), .funs = as.factor)
At this stage, we have only read the data based on the covid package. The resulting data is stored at an object titled counties, which contains 783,216 observations and 19 variables. Note that we have filtered observations that do not have a numeric key and removed some columns that do not add any value to future analysis (e.g., invariant cols).
In this analysis, we have extracted the following datasets:
The CDC’s Social Vulnerability Index, where we would utilize the following four summary theme ranking variables:
a. Socioeconomic – RPL_THEME1
b. Household Composition & Disability – RPL_THEME2
c. Minority Status & Language – RPL_THEME3
d. Housing Type & Transportation – RPL_THEME4
Note that each of these indices were computed by the Centers for Disease Control and Prevention/ Agency for Toxic Substances and Disease Registry/ Geospatial Research, Analysis, and Services Program. The csv file titled: SVI2018_US_COUNTY.csv was downloaded on Nov 02, 2020 (click here to download the file after selecting counties for the Geography Type and CSV File for the File type and clicking GO).
We also extracted the following political and administrative variables, as of the beginning of the pandemic:
a. Voting results for all counties in the 2016 Presidential elections: Based on MIT Election Data and Science Lab (2018), we have obtained the 2016 elections data by county, and used the data to compute the percentage of total votes that went to President Trump, with the underlying hypothesis that the politicization of COVID response (e.g., perception/willingness to use face masks, policies and the population’s reaction to the disease) may be explained by party affiliation.
b. State Governor as of the beginning of the pandemic: Based on Wikipedia’s Table of State Governors (click for permanent link to the version we scraped), we scraped the Governor’s party affiliation since we hypothesized that it may impact the type of policies used on a state-level. Given that the District of Columbia does not have a governor, we imputed its value with “Democratic” since D.C.’s Mayor is a Democrat (and DC is not a state).
c. State’s CDC Region Classification: We have also engineered a region variable based on the CDC’s 10 Regions Framework. While geographic regions are hypothesized to be a factor in disease outbreaks, we chose to utilize the CDC regions specifically based on the following explanation from the aforementioned link: “CDC’s National Center for Chronic Disease Prevention and Health Promotion (NCCDPHP) is strengthening the consistency and quality of the guidance, communications, and technical assistance provided to states to improve coordination across our state programs”
# (1) Obtaining the svi data
svi <- read_csv("../Data/Input/SVI2018_US_COUNTY.csv") %>%
clean_names() %>% # make everything lower_case
select(location, fips, state, county, area_sqmi, e_totpop,
rpl_theme1, rpl_theme2, rpl_theme3, rpl_theme4) %>% # columns of interest
mutate(e_popdensity = e_totpop/ area_sqmi) %>% # computing population density
filter(across(where(is.numeric), ~. >= 0)) %>% # because NA's are coded as -999
filter(!state %in% c('HAWAII', 'ALASKA')) # filtering to contiguous US
# (2) Political and Adminstrative Data
#### (a) 2016 Elections results on a county level
elections2016 <- read_csv("../Data/Input/countypres_2000-2016.csv") %>% # reading data
clean_names() %>%
filter(year == 2016 & party == "republican" & # data for 2016 elections as %republican votes
!state %in% c('Alaska', 'Hawaii') ) %>% # contiguous US states only
mutate(fips = str_pad(fips, width = 5, side = 'left', pad = '0'), # proper FIPS data
percRepVotes = 100*(candidatevotes/totalvotes) ) %>% # computing % republican votes (from total votes)
select(fips, percRepVotes)
#### (b) State Governor
stateGovernor <- "https://en.wikipedia.org/w/index.php?title=List_of_United_States_governors&oldid=977828843" %>%
read_html() %>% html_node("table:nth-child(9)") %>% html_table(header = 2, fill = TRUE) # scraping data
colnames(stateGovernor) <- stateGovernor[1,] %>% tolower() # lower case first row and making it colnames
stateGovernor <- stateGovernor[-1, c(1,5)] # dropping first row
stateGovernor$party %<>% recode(`Democratic–Farmer–Labor` = 'Democratic', #replacing w/ Democratic
`Republican[note 1]` = 'Republican' ) #from Republican[note 1] to Republican
stateGovernor$state %<>% toupper() # converting to upper case to match the names in the svi dataset
stateGovernor[51,] = c('DISTRICT OF COLUMBIA', 'Democratic') # since DC Mayor is a Democrat (DC is not a state)
#### (c) State Region Classification
cdcRegions = data.frame(state = c('Connecticut', 'Maine', 'Massachusetts', 'New Hampshire', 'Rhode Island' ,
'Vermont', 'New York', # End of Region A
'Delaware', 'District of Columbia', 'Maryland', 'Pennsylvania',
'Virginia', 'West Virginia', 'New Jersey', # End of Region B
'North Carolina', 'South Carolina', 'Georgia', 'Florida', # Region C
'Kentucky', 'Tennessee', 'Alabama', 'Mississippi', # Region D
'Illinois', 'Indiana', 'Michigan', 'Minnesota', 'Ohio',
'Wisconsin', # End of Region E
'Arkansas', 'Louisiana', 'New Mexico', 'Oklahoma', 'Texas', # Region F
'Iowa', 'Kansas', 'Missouri', 'Nebraska', # Region G
'Colorado', 'Montana', 'North Dakota', 'South Dakota',
'Utah', 'Wyoming', # End of Region H
'Arizona', 'California', 'Hawaii', 'Nevada', # Region I
'Alaska', 'Idaho', 'Oregon', 'Washington' # Region J
) %>% toupper(),
region = c(rep('A', 7), rep('B', 7), rep('C', 4),
rep('D', 4), rep('E', 6), rep('F', 5),
rep('G', 4), rep('H', 6), rep('I', 4),
rep('J', 4) ) )
# (3) Combining all the potential predictors in a crossSectionDatal Frame
crossSectionalData <- inner_join(svi, elections2016, by = 'fips') %>%
left_join(stateGovernor, by = 'state') %>%
left_join(cdcRegions, by = 'state')
# saving the results as a RDS File
saveRDS(crossSectionalData, '../Data/Output/crossSectionalData.rds')
# Tabulating the results and providing a way to export the table to different formats
datatable(crossSectionalData %>% select(-c(state, county)),
extensions = c('FixedColumns', 'Buttons'), options = list(
dom = 'Bfrtip',
scrollX = TRUE,
buttons = c('copy', 'csv', 'excel', 'pdf'),
fixedColumns = list(leftColumns = 2))
) %>%
formatRound(columns= c('area_sqmi', 'rpl_theme1', 'rpl_theme2',
'rpl_theme3', 'rpl_theme4', 'e_popdensity', 'percRepVotes'),
digits=2)
In this section, we perform an exploratory analysis on the data obtained from the multiple sources.
noGoogleNAs <- filter(counties, !is.na(key_google_mobility)) # removing NAs from key_google_mobility
idIndex <- sample(noGoogleNAs$id, 9) # sampling 9 counties by id
# Saving cumulative deaths figure to an tiff file
tiff(filename = '../Figures/sampleCumulativeCases.tiff',
width = 1366, height =768, pointsize = 16)
counties %>% filter(id %in% idIndex) %>%
ggplot(aes(x = date, y = confirmed, group = id, color = key_google_mobility)) +
geom_line(size = 1.25) +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
facet_wrap(~ key_google_mobility, scales = 'free_y', ncol = 3) +
theme(legend.position = 'none') +
labs(color = '', x = 'Month', y = 'Cumulative Cases By County') +
scale_color_brewer(type = 'qual', palette = 'Paired')
invisible( dev.off() ) # to suppress the unwanted output from dev.off
# Creating an interactive plot for the markdown
p <- ggplot2::last_plot() + geom_line(size = 0.75) + # modifying the plot for plotly
theme_bw(base_size = 9) + theme(legend.position = 'none') # to make margins smaller
ggplotly(p, height = 768) %>% layout_ggplotly()
# Saving cumulative deaths figure to an tiff file
tiff(filename = '../Figures/sampleCumulativeDeaths.tiff',
width = 1366, height =768, pointsize = 16)
counties %>% filter(id %in% idIndex) %>%
ggplot(aes(x = date, y = deaths, group = id, color = key_google_mobility)) +
geom_line(size = 1.25) +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
facet_wrap(~ key_google_mobility, scales = 'free_y', ncol = 3) +
theme(legend.position = 'none') +
labs(color = '', x = 'Month', y = 'Cumulative Deaths By County') +
scale_color_brewer(type = 'qual', palette = 'Paired')
invisible( dev.off() ) # to suppress the unwanted output from dev.off
# Creating an interactive plot for the markdown
p <- ggplot2::last_plot() + geom_line(size = 0.75) + # modifying the plot for plotly
theme_bw(base_size = 9) + theme(legend.position = 'none') # to make margins smaller
ggplotly(p, height = 768) %>% layout_ggplotly()
# Retrieving the U.S. county composite map as a simplefeature
cty_sf <- counties_sf("longlat") %>% filter(!state %in% c('Alaska', 'Hawaii')) # from albersua
cty_sf %<>% geo_join(crossSectionalData, by_sp= 'fips', by_df= 'fips')
# Saving a static version of the figure as tiff
tiff(filename = '../Figures/rplTheme1Map.tiff', width = 1366, height =768, pointsize = 16)
tm_shape(cty_sf) + tm_polygons('rpl_theme1', title = 'RPL Theme 1: Socioeconomic', palette = "-Greens")
invisible( dev.off() ) # to suppress the unwanted output from dev.off
# Printing a png version of the plot in Markdown (lower quality image for quicker compilation of HTML)
readTIFF("../Figures/rplTheme1Map.tiff") %>% grid.raster()
# Saving a static version of the figure (capitalizing on the tmap package)
tiff(filename = '../Figures/rplTheme2Map.tiff', width = 1366, height =768, pointsize = 16)
tm_shape(cty_sf) + tm_polygons('rpl_theme2', title = 'RPL Theme 2: Household Composition & Disability', palette = "-Greens")
invisible( dev.off() ) # to suppress the unwanted output from dev.off
# Printing a png version of the plot in Markdown (lower quality image for quicker compilation of HTML)
readTIFF("../Figures/rplTheme2Map.tiff") %>% grid.raster()
# Saving a static version of the figure (capitalizing on the tmap package)
tiff(filename = '../Figures/rplTheme3Map.tiff', width = 1366, height =768, pointsize = 16)
tm_shape(cty_sf) + tm_polygons('rpl_theme3', title = 'RPL Theme 3: Minority Status & Language', palette = "-Greens")
invisible( dev.off() ) # to suppress the unwanted output from dev.off
# Printing a png version of the plot in Markdown (lower quality image for quicker compilation of HTML)
readTIFF("../Figures/rplTheme3Map.tiff") %>% grid.raster()
# Saving a static version of the figure (capitalizing on the tmap package)
tiff(filename = '../Figures/rplTheme4Map.tiff', width = 1366, height =768, pointsize = 16)
tm_shape(cty_sf) + tm_polygons('rpl_theme4', title = 'RPL Theme 4: Housing Type & Transportation', palette = "-Greens")
invisible( dev.off() ) # to suppress the unwanted output from dev.off
# Printing a png version of the plot in Markdown (lower quality image for quicker compilation of HTML)
readTIFF("../Figures/rplTheme4Map.tiff") %>% grid.raster()
# Saving a static version of the figure (capitalizing on the tmap package)
tiff(filename = '../Figures/percRepVotesMap.tiff', width = 1366, height =768, pointsize = 16)
tm_shape(cty_sf) + tm_polygons('percRepVotes', title ='% Republican Votes', style = 'cont', palette = "div") +
tm_layout(aes.palette = list(div = c('#2166AC', '#F7F7F7', '#B2182B')))
invisible( dev.off() ) # to suppress the unwanted output from dev.off
# Printing a png version of the plot in Markdown (lower quality image for quicker compilation of HTML)
readTIFF("../Figures/percRepVotesMap.tiff") %>% grid.raster()
crossSectionalData$state %<>% tolower() %>% tools::toTitleCase()
# Retrieving the U.S. state composite map as a simplefeature
state_sf <- usa_sf("longlat") %>% filter(!name %in% c('Alaska', 'Hawaii')) # from albersua
state_sf %<>% geo_join(crossSectionalData, by_sp= 'name', by_df= 'state')
# Saving a static version of the figure (capitalizing on the tmap package)
tiff(filename = '../Figures/cdcMap.tiff', width = 1366, height =768, pointsize = 16)
tm_shape(state_sf) + tm_polygons('region', title = 'CDC Region', palette = "Paired")
invisible( dev.off() ) # to suppress the unwanted output from dev.off
# Printing a png version of the plot in Markdown (lower quality image for quicker compilation of HTML)
readTIFF("../Figures/cdcMap.tiff") %>% grid.raster()
# Saving a static version of the figure (capitalizing on the tmap package)
tiff(filename = '../Figures/governorsMap.tiff', width = 1366, height =768, pointsize = 16)
tm_shape(state_sf) + tm_polygons('party', title = "Governor's Party", palette = "div") +
tm_layout(aes.palette = list(div = list('Democratic' = '#1F78B4', 'Republican' = '#E31A1C')))
invisible( dev.off() ) # to suppress the unwanted output from dev.off
# Printing a png version of the plot in Markdown (lower quality image for quicker compilation of HTML)
readTIFF("../Figures/governorsMap.tiff") %>% grid.raster()
It is important to note that, in our estimation, there are three important decisions to be made when performing time-series clustering:
Preparation of the Different Time-Series to be Clustered In this section, we have (a) selected the new daily deaths per county as the primary variable of interest, (b) smoothed that variable using a seven-day moving average, and (c) scaled the observations within each county’s 7-day MA of new daily deaths such that it is bounded between 0 and 1. This allows us to compare the shape of the time-series/profile across counties of different populations and where the magnitude of the deaths is quite different.
Choice of Distance Measure: The Euclidean distance, The Euclidean Distance i.e., the \(l_2\) norm, is the most commonly used distance measure since it is computationally efficient. However, it may not be suitable for applications where the time-series are of different length in addition to being sensitive to noise, scale and time shifts (Sardá-Espinosa, 2017).
Choice of Clustering Algorithm: A large number of clustering algorithms have been proposed in the literature. Most common clustering approaches are shape-based, which include \(k-\)means clustering and hierarchical clustering. The reader is referred to Aghabozorgi et al. (2015) for a detailed review. In our preliminary analysis, we have chosen to use the hierarchical clustering approach since it provides an easy to understand dendogram and the number of counties was small. However, in our full analysis, we will use the \(k-\)means clustering algorithm since it is computationally efficient. Furthermore, we overcame the traditional limitation of having to pre-specify \(k\) by utilizing 26 indexes for determining the optimal number of clusters in a data set based on the excellent approach and package implementation of Charrad et al. (2014).
clusteringPrep <- counties %>% # from the counties
select(id, date, key_google_mobility, newDeaths) %>% # selecting minimal amount of cols for visual inspection
arrange(id, date) %>% # arranged to ensure correct calculations
mutate(newMA7 = rollmeanr(newDeaths, k = 7, fill = NA), # 7-day ma of new (adjusted) deaths
maxMA7 = max(newMA7, na.rm = T), # obtaining the max per county to scale data
scaledNewMA7 = pmax(0, newMA7/maxMA7, na.rm = TRUE) ) %>% # scaling data to a 0-1 scale by county
select(id, key_google_mobility, date, scaledNewMA7) %>% # dropping the variable newDeaths
pivot_wider(names_from = date, values_from = scaledNewMA7) # converting the data to a wide format for clustering
constantColumns <- whichAreConstant(clusteringPrep, verbose = F) # identifying constant columns
datesDropped <- colnames(clusteringPrep)[constantColumns] # used for printing the names after the code chunk
clusteringPrep %<>% select(-all_of(constantColumns) ) %>% # speeds up clustering by dec length of series
as.data.frame() # data needs to be data frame for clustering
row.names(clusteringPrep) = clusteringPrep[,1] # needed for tsclust
clusteringPrep = clusteringPrep[,-1] # dropping the id column since it is now row.name
The following dates were removed from our data frame since the scaledNEWMA7 variable was constant across all counties: 2020-03-01, 2020-03-02, 2020-03-03, 2020-03-04, 2020-03-05, 2020-03-06 and 2020-03-07.
clusteringPrep %<>% select(-c(key_google_mobility)) # removing this variable so we can cluster
nc <- NbClust(clusteringPrep, distance = "euclidean", # euclidean distance
min.nc = 2, max.nc = 50, # searching for optimal k between k=2 and k=50
method = "kmeans", # using the k-means method
index = "all") # using 26 of the 30 indices in the package
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 6 proposed 2 as the best number of clusters
## * 12 proposed 3 as the best number of clusters
## * 1 proposed 18 as the best number of clusters
## * 1 proposed 23 as the best number of clusters
## * 1 proposed 39 as the best number of clusters
## * 1 proposed 45 as the best number of clusters
## * 1 proposed 50 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
kclus <- nc$Best.partition %>% as.data.frame() %>% #obtaining the best partition/ cluster assignment for optimal k
rename(., cluster_group = .) %>% rownames_to_column("County")
#converting the wide to tall data and adding the cluster groupings
clusters <- clusteringPrep %>%
rownames_to_column(var = "County") %>%
pivot_longer(cols = starts_with("2020"), names_to = "Date") %>%
inner_join(., kclus, by = "County") %>%
mutate(cluster_group = as.factor(cluster_group))
idClusters <- clusters %>% select(c(County, cluster_group)) # creating a look-up table of county and cluster group
colnames(idClusters) <- c('id', 'cluster_group') # renaming the columns
idClusters %<>% unique() #removing the duplicates due to different dates (we had that to ensure that the clustering was applied correctly)
# Adding Cluster Grouping to a subset of the counties data frame
clusterCounties <- counties %>%
select(c(id, key_numeric, key_google_mobility, administrative_area_level_2, administrative_area_level_3)) %>%
inner_join(., idClusters, by ='id') %>%
mutate(cluster_group = paste0('C', cluster_group)) %>%
unique()
# saving the results as a RDS File
saveRDS(clusterCounties, '../Data/Output/clusterCounties.rds')
In this subsection, we provide three plots:
spaghettiDF <- counties %>% # from the counties
select(id, date, newDeaths, key_google_mobility) %>% # selecting minimal columns
left_join(clusterCounties[, c('id', 'cluster_group')], by = 'id') %>% # to get clusters
arrange(id, date) %>% # arranged to ensure correct calculations
mutate(newMA7 = rollmeanr(newDeaths, k = 7, fill = NA), # 7-day ma of new (adjusted) deaths
maxMA7 = max(newMA7, na.rm = T), # obtaining the max per county to scale data
scaledNewMA7 = pmax(0, newMA7/maxMA7, na.rm = TRUE) ) %>%
ungroup() %>% select(date, cluster_group, scaledNewMA7, key_google_mobility) %>%
group_by(date, cluster_group)
# Creating a Named Color Scale
colorPal <- brewer.pal(n= levels(spaghettiDF$cluster_group) %>% length(), 'Set2')
names(colorPal) <- levels(spaghettiDF$cluster_group)
# Saving spaghetti plot to an tiff file
tiff(filename = '../Figures/spaghettiPlot.tiff', width = 1366, height =768, pointsize = 16)
spaghettiDF %>%
ggplot(aes(x = date, y = scaledNewMA7, color = cluster_group, group = key_google_mobility)) +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
geom_line(size = 0.25, alpha = 0.1) +
stat_summary(aes(group = 1),
fun= median,
geom = "line",
size = 1.25, col = 'black') +
facet_wrap(~ cluster_group, ncol = 1) +
theme(legend.position = 'none') +
labs(x = 'Month', y = 'Scaled New Deaths By Cluster By Day',
caption = paste0('Solid black line represents the median for each cluster |
Based on Data from March 01, 2020 - ', endDatePrintV) ) +
scale_color_manual(values = colorPal)
invisible( dev.off() ) # to suppress the unwanted output from dev.off
# Printing a png version of the plot in Markdown (lower quality image for quicker compilation of HTML)
readTIFF("../Figures/spaghettiPlot.tiff") %>% grid.raster()
# Creating a data frame containing statistical summaries of the time series by cluster_group
summaryDf <- spaghettiDF %>%
summarise(Median = median(scaledNewMA7, na.rm= TRUE),
`First Quartile` = quantile(scaledNewMA7, probs = 0.25, na.rm= TRUE),
`Third Quartile` = quantile(scaledNewMA7, probs = 0.75, na.rm= TRUE)) %>%
pivot_longer(cols = c(`First Quartile`, Median, `Third Quartile`),
names_to = 'Statistic')
tiff(filename = '../Figures/summaryPlot.tiff', width = 1366, height =768, pointsize = 16)
summaryDf %>%
ggplot(aes(x = date, y = value, color = cluster_group, linetype = Statistic)) +
scale_x_date(date_breaks = "1 month", date_labels = "%b") +
geom_line(size = 1.25) +
scale_linetype_manual(values = c('dotted', 'solid', 'twodash')) +
facet_wrap(~ cluster_group, ncol = 1) +
theme(legend.position = 'top') +
labs(color = '', x = 'Month', y = 'Quartiles of Scaled New Deaths By Cluster By Day') +
scale_color_manual(values = colorPal)
invisible( dev.off() ) # to suppress the unwanted output from dev.off
# Printing a png version of the plot in Markdown (lower quality image for quicker compilation of HTML)
readTIFF("../Figures/summaryPlot.tiff") %>% grid.raster()
# Joining the clusterCounties results with the existing county simple features object (cty_sf)
clusterCounties$fips <- str_pad(clusterCounties$key_numeric, width = 5, side = 'left', pad = '0')
clusterCounties %<>% ungroup()
cty_sf %<>% left_join(clusterCounties[, c('fips', 'cluster_group')], by = 'fips') # adding cluster_group to cty_sf
# Creating a static visual for the paper
tiff(filename = '../Figures/clusterMap.tiff', width = 1366, height =768, pointsize = 16)
tm_shape(cty_sf) + tm_polygons('cluster_group', title = 'Cluster #', palette = colorPal)
invisible( dev.off() ) # to suppress the unwanted output from dev.off
# Creating an interactive visual Using the Leaflet Package
#### Setting the Color Scheme
leafletPal <- colorFactor('Set2', domain = cty_sf$cluster_group, na.color = "white")
#### The visual
leaflet(height=500) %>% # initializing the leaflet map
setView(lng = -96, lat = 37.8, zoom = 3.8) %>% # setting the view on Continental US
addTiles() %>% # adding the default tiles
addPolygons(data = cty_sf, stroke = FALSE, fillColor = ~leafletPal(cty_sf$cluster_group), # adding the data
weight = 2, opacity = 1, color = "white", dashArray = "3", fillOpacity = 0.7, # adding color specs
popup = paste("County:", cty_sf$name, '<br>',
"Cluster #:", cty_sf$cluster_group, '<br>',
"Population Density:", round(cty_sf$e_popdensity, 1), '<br>')) %>% #pop-up Menu
addLegend(position = "bottomleft", pal = leafletPal, values = cty_sf$cluster_group,
title = "Cluster #", opacity = 1) # legend formatting
In the previous section, we showed that by using solely a scaled and smoothed time series of daily deaths per county, the counties are grouped into 3 categories (whose time-series have distinct shapes based on the Euclidean distance measure). In this section, we attempt to model the factors that are associated with the cluster assignment.
# Combining multiClass response with potential predictors
multiClassDF <- select(clusterCounties, fips, cluster_group) %>%
left_join(crossSectionalData, by = 'fips') %>%
select(-c(e_totpop, area_sqmi, state, county))
saveRDS(multiClassDF, '../Data/Output/multiClassDF.rds') # saving the data
skim(multiClassDF) # printing a nice summary table of the data
| Name | multiClassDF |
| Number of rows | 3108 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| character | 5 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| fips | 0 | 1 | 5 | 5 | 0 | 3108 | 0 |
| cluster_group | 0 | 1 | 2 | 2 | 0 | 3 | 0 |
| location | 2 | 1 | 16 | 42 | 0 | 3106 | 0 |
| party | 2 | 1 | 10 | 10 | 0 | 2 | 0 |
| region | 2 | 1 | 1 | 1 | 0 | 10 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| rpl_theme1 | 2 | 1 | 0.50 | 0.29 | 0.00 | 0.25 | 0.50 | 0.75 | 1.00 | ▇▇▇▇▇ |
| rpl_theme2 | 2 | 1 | 0.50 | 0.29 | 0.00 | 0.25 | 0.50 | 0.75 | 1.00 | ▇▇▇▇▇ |
| rpl_theme3 | 2 | 1 | 0.50 | 0.29 | 0.00 | 0.25 | 0.50 | 0.75 | 1.00 | ▇▇▇▇▇ |
| rpl_theme4 | 2 | 1 | 0.50 | 0.29 | 0.00 | 0.25 | 0.50 | 0.75 | 1.00 | ▇▇▇▇▇ |
| e_popdensity | 2 | 1 | 273.11 | 1813.57 | 0.15 | 17.37 | 45.26 | 118.68 | 72052.99 | ▇▁▁▁▁ |
| percRepVotes | 2 | 1 | 63.31 | 15.63 | 4.09 | 54.48 | 66.35 | 74.92 | 96.03 | ▁▁▅▇▃ |
tiff(filename = '../Figures/corrPlot.tiff',
width = 1366, height =768, pointsize = 16)
na.omit(multiClassDF) %>%
plot_correlation(ggtheme = theme_bw(), type = 'c') # compute corr among only continuous vars
invisible( dev.off() ) # to suppress the unwanted output from dev.off
# Printing a png version of the plot in Markdown (lower quality image for quicker compilation of HTML)
readTIFF("../Figures/corrPlot.tiff") %>% grid.raster()
# Saving cumulative deaths figure to a tiff file
tiff(filename = '../Figures/boxPlot.tiff',
width = 1366, height =768, pointsize = 16)
multiClassDF %>%
plot_boxplot(by = 'cluster_group', ncol = 2L,
ggtheme = theme_bw(),
geom_boxplot_args = list('outlier.shape' = 1))
invisible( dev.off() ) # to suppress the unwanted output from dev.off
# Printing a png version of the plot in Markdown (lower quality image for quicker compilation of HTML)
readTIFF("../Figures/boxPlot.tiff") %>% grid.raster()
df <- multiClassDF %>% select(-c(fips, location)) %>%
mutate(cluster_group = as.factor(cluster_group))
# setting the reference level to category with max frequency
df$clustReLeveled <- relevel(df$cluster_group, ref = maxCat(df$cluster_group) )
df <- df %>% select(-c(cluster_group)) # removed since we created a reLeveled version and stored it in a diff va
finalModel <- multinom(clustReLeveled ~ ., data = df) # create model
## # weights: 54 (34 variable)
## initial value 3412.289769
## iter 10 value 2279.964165
## iter 20 value 2165.730051
## iter 30 value 2135.056284
## iter 40 value 2118.518517
## final value 2117.666623
## converged
# Saving the results as a latex table, but not printing it out in the Markdown document
invisible(stargazer(finalModel, type = 'latex', p.auto = FALSE, out="../Data/Output/multi.tex",
single.row = TRUE, header = FALSE))
# tabulating model results as an HTML, which we print below
stargazer(finalModel, type = 'html', p.auto = FALSE, out="../Data/Output/multi.html", single.row = FALSE)
| Dependent variable: | ||
| C1 | C2 | |
| (1) | (2) | |
| rpl_theme1 | -0.816*** | -0.539*** |
| (0.090) | (0.100) | |
| rpl_theme2 | 0.601*** | 0.287 |
| (0.188) | (0.182) | |
| rpl_theme3 | 1.988*** | 2.493*** |
| (0.143) | (0.117) | |
| rpl_theme4 | 1.106*** | 0.633*** |
| (0.149) | (0.142) | |
| e_popdensity | 0.003*** | 0.003*** |
| (0.0003) | (0.0003) | |
| percRepVotes | -0.012*** | -0.028*** |
| (0.003) | (0.003) | |
| partyRepublican | 1.064*** | 0.377*** |
| (0.121) | (0.132) | |
| regionB | 1.901*** | -0.398** |
| (0.184) | (0.171) | |
| regionC | 3.404*** | -0.880*** |
| (0.127) | (0.175) | |
| regionD | 3.494*** | -0.228 |
| (0.129) | (0.170) | |
| regionE | 1.715*** | -0.142 |
| (0.154) | (0.128) | |
| regionF | 2.325*** | -1.708*** |
| (0.111) | (0.159) | |
| regionG | 0.889*** | -1.848*** |
| (0.190) | (0.207) | |
| regionH | 0.636*** | -1.783*** |
| (0.235) | (0.222) | |
| regionI | 3.034*** | -2.125*** |
| (0.191) | (0.077) | |
| regionJ | 2.136*** | -1.716*** |
| (0.190) | (0.073) | |
| Constant | -4.871*** | -0.873*** |
| (0.078) | (0.089) | |
| Akaike Inf. Crit. | 4,303.333 | 4,303.333 |
| Note: | p<0.1; p<0.05; p<0.01 | |
# examining how well the model performed on our entire dataset
# Recall that we are fitting an explanatory model, and not a predictive model
predictedClass <- predict(finalModel, df)
# Computing the Confusion Metrics and By Class Metrics
confMatrix = confusionMatrix(predictedClass, df$clustReLeveled)
saveRDS(confMatrix, '../Data/Output/confMatrix.rds') # saving the data
# Printing the Resulting Tables Nicely
pander(confMatrix$table)
| C3 | C1 | C2 | |
|---|---|---|---|
| C3 | 1639 | 284 | 214 |
| C1 | 153 | 474 | 94 |
| C2 | 40 | 32 | 176 |
pander(confMatrix$byClass)
| Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | |
|---|---|---|---|---|
| Class: C3 | 0.8947 | 0.6091 | 0.767 | 0.8008 |
| Class: C1 | 0.6 | 0.8934 | 0.6574 | 0.8675 |
| Class: C2 | 0.3636 | 0.9725 | 0.7097 | 0.8922 |
| Precision | Recall | F1 | Prevalence | Detection Rate | |
|---|---|---|---|---|---|
| Class: C3 | 0.767 | 0.8947 | 0.8259 | 0.5898 | 0.5277 |
| Class: C1 | 0.6574 | 0.6 | 0.6274 | 0.2543 | 0.1526 |
| Class: C2 | 0.7097 | 0.3636 | 0.4809 | 0.1558 | 0.05666 |
| Detection Prevalence | Balanced Accuracy | |
|---|---|---|
| Class: C3 | 0.688 | 0.7519 |
| Class: C1 | 0.2321 | 0.7467 |
| Class: C2 | 0.07985 | 0.6681 |
pander(confMatrix$overall)
| Accuracy | Kappa | AccuracyLower | AccuracyUpper | AccuracyNull |
|---|---|---|---|---|
| 0.737 | 0.4968 | 0.7211 | 0.7524 | 0.5898 |
| AccuracyPValue | McnemarPValue |
|---|---|
| 5.86e-66 | 1.016e-40 |
predictedProbs <- fitted(finalModel) # computing predicted probabilities for each of the cluster outcome levels
mapResults <- cbind(na.omit(multiClassDF), predictedProbs) # col binding predProbs for Each Cluster with multiClassDF
# Finding indices to subset the data
numberOfClusters <- unique(mapResults$cluster_group) %>% as.character() %>% length()
startCol <- ncol(mapResults) - numberOfClusters + 1
endCol <- ncol(mapResults)
# Finding whether the predicted and actual clusters matched for each county
mapResults$LargestProbCluster <- colnames(mapResults[, startCol:endCol])[apply(mapResults[, startCol:endCol], 1, which.max)]
mapResults$match <- ifelse(mapResults$cluster_group == mapResults$LargestProbCluster, 'Yes', 'No') %>% as.factor()
# Retrieving the U.S. county composite map as a simplefeature (since it has been overwritten)
cty_sf <- counties_sf("longlat") %>% filter(!state %in% c('Alaska', 'Hawaii')) # from albersua
cty_sf %<>% geo_join(mapResults, by_sp= 'fips', by_df= 'fips')
# Creating a static visual for the paper
tiff(filename = '../Figures/clusterMatchMap.tiff', width = 1366, height =768, pointsize = 16)
tm_shape(cty_sf) + tm_polygons('match', title = 'Cluster Match', style = 'cont', palette = "div") +
tm_layout(aes.palette = list(div = list("Yes" = "#CAB2D6", "No" = "#6A3D9A")))
invisible( dev.off() ) # to suppress the unwanted output from dev.off
# Creating an interactive visual Using the Leaflet Package
#### Setting the Color Scheme
leafletPal <- colorFactor(palette = c("#CAB2D6", "#6A3D9A"), levels = c('Yes', 'No'), na.color = "white")
#### The visual
leaflet(height=500) %>% # initializing the leaflet map
setView(lng = -96, lat = 37.8, zoom = 3.8) %>% # setting the view on Continental US
addTiles() %>% # adding the default tiles
addPolygons(data = cty_sf, stroke = FALSE, fillColor = ~leafletPal(cty_sf$match), # adding the data
weight = 2, opacity = 1, color = "white", dashArray = "3", fillOpacity = 0.7, # adding color specs
popup = paste("County:", cty_sf$name, '<br>',
"Cluster #:", cty_sf$cluster_group, '<br>',
"Cluster Predicted:", cty_sf$LargestProbCluster, '<br>',
"Cluster Match:", cty_sf$match, '<br>')) %>% #pop-up Menu
addLegend(position = "bottomleft", pal = leafletPal, values = cty_sf$match,
title = "Cluster Match", opacity = 1) # legend formatting
In this R Markdown document, we have shown that our proposed two stage framework for modeling the smoothed and scaled time series of new daily cases can provide insights into the shape of the outbreak’s time-series and some of its associated factors. Specifically, we have shown that:
On a county-level, the time series of COVID-19 new daily cases can be clustered into 3 clusters.
Using a multinomial regression model, we have shown/quantified the impact of the following factors: rpl_theme1, rpl_theme2, rpl_theme3, rpl_theme4, e_popdensity, percRepVotes, party and region on the odds of being in a specific cluster when compared to the baseline.
Aghabozorgi, S., Shirkhorshidi, A. S., & Wah, T. Y. (2015). Time-series clustering–a decade review. Information Systems, 53, 16–38.
Charrad, M., Ghazzali, N., Boiteau, V., & Niknafs, A. (2014). NbClust: An r package for determining the relevant number of clusters in a data set. Journal of Statistical Software, Articles, 61(6), 1–36. https://doi.org/10.18637/jss.v061.i06
Guidotti, E., & Ardia, D. (2020). COVID-19 data hub. Journal of Open Source Software, 5(51), 2376. https://doi.org/10.21105/joss.02376
MIT Election Data and Science Lab. (2018). County Presidential Election Returns 2000-2016 (Version V6) [Data set]. Harvard Dataverse. https://doi.org/10.7910/DVN/VOQCHQ
In this appendix, we print all the R packages used in our analysis and their versions to assist with reproducing our results/analysis.
pander(sessionInfo(), compact = TRUE)
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale: LC_COLLATE=English_United States.1252, LC_CTYPE=English_United States.1252, LC_MONETARY=English_United States.1252, LC_NUMERIC=C and LC_TIME=English_United States.1252
attached base packages: grid, stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: sf(v.0.9-6), conflicted(v.1.0.4), caret(v.6.0-86), lattice(v.0.20-41), nnet(v.7.3-14), VIM(v.6.0.0), colorspace(v.1.4-1), NbClust(v.3.0), expsmooth(v.2.3), fma(v.2.4), forecast(v.8.13), fpp2(v.2.4), zoo(v.1.8-8), tmap(v.3.2), leaflet(v.2.0.3), tigris(v.1.0), plotly(v.4.9.2.1), tiff(v.0.1-5), DataExplorer(v.0.8.1), RColorBrewer(v.1.1-2), scales(v.1.1.1), knitr(v.1.30), stargazer(v.5.2.2), pander(v.0.6.3), DT(v.0.16), rvest(v.0.3.6), xml2(v.1.3.2), COVID19(v.2.3.1), skimr(v.2.1.2), dataPreparation(v.0.4.3), progress(v.1.2.2), Matrix(v.1.2-18), lubridate(v.1.7.9), janitor(v.2.0.1), magrittr(v.1.5), forcats(v.0.5.0), stringr(v.1.4.0), dplyr(v.1.0.2), purrr(v.0.3.4), readr(v.1.4.0), tidyr(v.1.1.2), tibble(v.3.0.4), tidyverse(v.1.3.0), albersusa(v.0.4.1), devtools(v.2.3.2), usethis(v.1.6.3), pacman(v.0.5.1) and ggplot2(v.3.3.2)
loaded via a namespace (and not attached): tidyselect(v.1.1.0), htmlwidgets(v.1.5.2), ranger(v.0.12.1), maptools(v.1.0-2), pROC(v.1.16.2), munsell(v.0.5.0), codetools(v.0.2-16), units(v.0.6-7), withr(v.2.3.0), highr(v.0.8), uuid(v.0.1-4), rstudioapi(v.0.11), stats4(v.4.0.3), robustbase(v.0.93-6), vcd(v.1.4-8), TTR(v.0.24.2), labeling(v.0.4.2), repr(v.1.1.0), farver(v.2.0.3), rprojroot(v.1.3-2), vctrs(v.0.3.4), generics(v.0.1.0), ipred(v.0.9-9), xfun(v.0.19), R6(v.2.5.0), assertthat(v.0.2.1), networkD3(v.0.4), rgeos(v.0.5-5), gtable(v.0.3.0), lwgeom(v.0.2-5), processx(v.3.4.4), timeDate(v.3043.102), rlang(v.0.4.8), splines(v.4.0.3), rgdal(v.1.5-18), lazyeval(v.0.2.2), ModelMetrics(v.1.2.2.2), selectr(v.0.4-2), dichromat(v.2.0-0), broom(v.0.7.2), reshape2(v.1.4.4), yaml(v.2.2.1), abind(v.1.4-5), modelr(v.0.1.8), crosstalk(v.1.1.0.1), backports(v.1.2.0), quantmod(v.0.4.17), lava(v.1.6.8), tools(v.4.0.3), ellipsis(v.0.3.1), raster(v.3.3-13), sessioninfo(v.1.1.1), Rcpp(v.1.0.5), plyr(v.1.8.6), base64enc(v.0.1-3), classInt(v.0.4-3), ps(v.1.4.0), prettyunits(v.1.1.1), rpart(v.4.1-15), fracdiff(v.1.5-1), tmaptools(v.3.1), haven(v.2.3.1), fs(v.1.5.0), leafem(v.0.1.3), data.table(v.1.13.2), openxlsx(v.4.2.3), lmtest(v.0.9-38), reprex(v.0.3.0), pkgload(v.1.1.0), hms(v.0.5.3), evaluate(v.0.14), XML(v.3.99-0.5), rio(v.0.5.16), readxl(v.1.3.1), gridExtra(v.2.3), testthat(v.3.0.0), compiler(v.4.0.3), KernSmooth(v.2.23-18), crayon(v.1.3.4), htmltools(v.0.5.0), DBI(v.1.1.0), dbplyr(v.1.4.4), MASS(v.7.3-53), rappdirs(v.0.3.1), boot(v.1.3-25), car(v.3.0-10), cli(v.2.1.0), quadprog(v.1.5-8), gower(v.0.2.2), parallel(v.4.0.3), igraph(v.1.2.6), pkgconfig(v.2.0.3), foreign(v.0.8-80), laeken(v.0.5.1), sp(v.1.4-4), recipes(v.0.1.14), foreach(v.1.5.1), prodlim(v.2019.11.13), snakecase(v.0.11.0), callr(v.3.5.1), digest(v.0.6.27), rmarkdown(v.2.5), cellranger(v.1.1.0), leafsync(v.0.1.0), curl(v.4.3), urca(v.1.3-0), lifecycle(v.0.2.0), nlme(v.3.1-150), jsonlite(v.1.7.1), tseries(v.0.10-47), carData(v.3.0-4), desc(v.1.2.0), viridisLite(v.0.3.0), fansi(v.0.4.1), pillar(v.1.4.6), survival(v.3.2-7), httr(v.1.4.2), DEoptimR(v.1.0-8), pkgbuild(v.1.1.0), glue(v.1.4.2), xts(v.0.12.1), remotes(v.2.2.0), zip(v.2.1.1), png(v.0.1-7), iterators(v.1.0.13), leaflet.providers(v.1.9.0), class(v.7.3-17), stringi(v.1.5.3), blob(v.1.2.1), stars(v.0.4-3), memoise(v.1.1.0) and e1071(v.1.7-4)
Email: fmegahed@miamioh.edu | Phone: +1-513-529-4185 | Website: Miami University Official↩︎
Email: farmerl2@miamioh.edu | Phone: +1-513-529-4823 | Website: Miami University Official↩︎
Email: steve.rigdon@slu.edu | Website: Saint Louis University Official↩︎